When we naively calculate the average value of the total order amount, we assume that there will be no outliers in the records. This is the reason that we initially get an extremely high average order value of $3145.13 as the outliers are inflating the average values.
Moreover, calculating mean is not a good idea, when we do not know if there are any high value transactions in the dataset, at such times median of the amount provides us with a realisitc value which is closer to the actual mean.
If we absolutely need to calculate the mean or average value, then we mandatorily need to remove the outliers first.
As well as we need to check if there are any null values or missing values in the records that might affect our average value.
As each user can buy more than one item, which can also inflate the order amount for a few transactions, it is a much better idea to first calculate the per unit price for each order id.
This approach will help us to see if there are really any outliers in the dataset.
If we were to report any metrics on the dataset before any analysis was conducted, I would report the median of the order value.
Only after throughly handling the outliers in the dataset, I would report the mean or average order value.
Before removing outliers the MEDIAN value of the order amount is $284
After removing the outliers the AVERAGE order value is $303.85
import pandas as pd
import numpy as np
import seaborn as sns
import plotly.express as px
from matplotlib import pyplot as plt
df = pd.read_csv('2019 Winter Data Science Intern Challenge Data Set - Sheet1.csv')
df
| order_id | shop_id | user_id | order_amount | total_items | payment_method | created_at | |
|---|---|---|---|---|---|---|---|
| 0 | 1 | 53 | 746 | 224 | 2 | cash | 2017-03-13 12:36:56 |
| 1 | 2 | 92 | 925 | 90 | 1 | cash | 2017-03-03 17:38:52 |
| 2 | 3 | 44 | 861 | 144 | 1 | cash | 2017-03-14 4:23:56 |
| 3 | 4 | 18 | 935 | 156 | 1 | credit_card | 2017-03-26 12:43:37 |
| 4 | 5 | 18 | 883 | 156 | 1 | credit_card | 2017-03-01 4:35:11 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 4995 | 4996 | 73 | 993 | 330 | 2 | debit | 2017-03-30 13:47:17 |
| 4996 | 4997 | 48 | 789 | 234 | 2 | cash | 2017-03-16 20:36:16 |
| 4997 | 4998 | 56 | 867 | 351 | 3 | cash | 2017-03-19 5:42:42 |
| 4998 | 4999 | 60 | 825 | 354 | 2 | credit_card | 2017-03-16 14:51:18 |
| 4999 | 5000 | 44 | 734 | 288 | 2 | debit | 2017-03-18 15:48:18 |
5000 rows × 7 columns
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 order_id 5000 non-null int64 1 shop_id 5000 non-null int64 2 user_id 5000 non-null int64 3 order_amount 5000 non-null int64 4 total_items 5000 non-null int64 5 payment_method 5000 non-null object 6 created_at 5000 non-null object dtypes: int64(5), object(2) memory usage: 273.6+ KB
df.isnull().sum()
order_id 0 shop_id 0 user_id 0 order_amount 0 total_items 0 payment_method 0 created_at 0 dtype: int64
df.nunique()
order_id 5000 shop_id 100 user_id 301 order_amount 258 total_items 8 payment_method 3 created_at 4991 dtype: int64
df['created_at'] = pd.to_datetime(df['created_at'])
all_dates = (pd.to_datetime(df['created_at'])).apply(lambda x:x.date())
print('Total number of days of transaction: {}'.format(((all_dates.max() - all_dates.min()).days+1)))
print('Start date: {}'.format(all_dates.min()))
print('End date: {}'.format(all_dates.max()))
Total number of days of transaction: 30 Start date: 2017-03-01 End date: 2017-03-30
df['unit_price'] = df['order_amount'] / df['total_items']
df_corrected = df.copy()
df
| order_id | shop_id | user_id | order_amount | total_items | payment_method | created_at | unit_price | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 53 | 746 | 224 | 2 | cash | 2017-03-13 12:36:56 | 112.0 |
| 1 | 2 | 92 | 925 | 90 | 1 | cash | 2017-03-03 17:38:52 | 90.0 |
| 2 | 3 | 44 | 861 | 144 | 1 | cash | 2017-03-14 04:23:56 | 144.0 |
| 3 | 4 | 18 | 935 | 156 | 1 | credit_card | 2017-03-26 12:43:37 | 156.0 |
| 4 | 5 | 18 | 883 | 156 | 1 | credit_card | 2017-03-01 04:35:11 | 156.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 4995 | 4996 | 73 | 993 | 330 | 2 | debit | 2017-03-30 13:47:17 | 165.0 |
| 4996 | 4997 | 48 | 789 | 234 | 2 | cash | 2017-03-16 20:36:16 | 117.0 |
| 4997 | 4998 | 56 | 867 | 351 | 3 | cash | 2017-03-19 05:42:42 | 117.0 |
| 4998 | 4999 | 60 | 825 | 354 | 2 | credit_card | 2017-03-16 14:51:18 | 177.0 |
| 4999 | 5000 | 44 | 734 | 288 | 2 | debit | 2017-03-18 15:48:18 | 144.0 |
5000 rows × 8 columns
df.sort_values(by=['unit_price'],ascending=False)
| order_id | shop_id | user_id | order_amount | total_items | payment_method | created_at | unit_price | |
|---|---|---|---|---|---|---|---|---|
| 4420 | 4421 | 78 | 969 | 77175 | 3 | debit | 2017-03-09 15:21:35 | 25725.0 |
| 2906 | 2907 | 78 | 817 | 77175 | 3 | debit | 2017-03-16 03:45:46 | 25725.0 |
| 4505 | 4506 | 78 | 866 | 25725 | 1 | debit | 2017-03-22 22:06:01 | 25725.0 |
| 493 | 494 | 78 | 983 | 51450 | 2 | cash | 2017-03-16 21:39:35 | 25725.0 |
| 2452 | 2453 | 78 | 709 | 51450 | 2 | cash | 2017-03-27 11:04:04 | 25725.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2714 | 2715 | 92 | 905 | 180 | 2 | cash | 2017-03-15 17:02:28 | 90.0 |
| 3915 | 3916 | 92 | 760 | 180 | 2 | cash | 2017-03-26 15:44:40 | 90.0 |
| 4760 | 4761 | 92 | 937 | 90 | 1 | debit | 2017-03-20 07:37:28 | 90.0 |
| 2128 | 2129 | 92 | 729 | 180 | 2 | credit_card | 2017-03-03 17:26:32 | 90.0 |
| 1927 | 1928 | 92 | 960 | 270 | 3 | debit | 2017-03-21 03:44:08 | 90.0 |
5000 rows × 8 columns
df['order_amount'].describe()
count 5000.000000 mean 3145.128000 std 41282.539349 min 90.000000 25% 163.000000 50% 284.000000 75% 390.000000 max 704000.000000 Name: order_amount, dtype: float64
df['order_amount'].median()
284.0
shop_mean = pd.DataFrame(df.groupby('shop_id')['unit_price'].mean().sort_values(ascending=False))
shop_mean.reset_index(inplace=True)
shop_mean
| shop_id | unit_price | |
|---|---|---|
| 0 | 78 | 25725.0 |
| 1 | 42 | 352.0 |
| 2 | 12 | 201.0 |
| 3 | 89 | 196.0 |
| 4 | 99 | 195.0 |
| ... | ... | ... |
| 95 | 53 | 112.0 |
| 96 | 100 | 111.0 |
| 97 | 32 | 101.0 |
| 98 | 2 | 94.0 |
| 99 | 92 | 90.0 |
100 rows × 2 columns
plt.figure(figsize=(16,4), dpi=200)
sns.lineplot(x=shop_mean['shop_id'],y=shop_mean['unit_price'])
<AxesSubplot:xlabel='shop_id', ylabel='unit_price'>
fig = px.scatter(
shop_mean,
x=shop_mean['shop_id'],
y=shop_mean['unit_price'],
width=950,
height=350,
color=shop_mean['unit_price']
)
fig.update_layout(
xaxis_title='Shop ID',
yaxis_title='Unit Price',
title='Anomaly Detection'
)
fig.show()
df[df['shop_id'] == 78]
| order_id | shop_id | user_id | order_amount | total_items | payment_method | created_at | unit_price | |
|---|---|---|---|---|---|---|---|---|
| 160 | 161 | 78 | 990 | 25725 | 1 | credit_card | 2017-03-12 05:56:57 | 25725.0 |
| 490 | 491 | 78 | 936 | 51450 | 2 | debit | 2017-03-26 17:08:19 | 25725.0 |
| 493 | 494 | 78 | 983 | 51450 | 2 | cash | 2017-03-16 21:39:35 | 25725.0 |
| 511 | 512 | 78 | 967 | 51450 | 2 | cash | 2017-03-09 07:23:14 | 25725.0 |
| 617 | 618 | 78 | 760 | 51450 | 2 | cash | 2017-03-18 11:18:42 | 25725.0 |
| 691 | 692 | 78 | 878 | 154350 | 6 | debit | 2017-03-27 22:51:43 | 25725.0 |
| 1056 | 1057 | 78 | 800 | 25725 | 1 | debit | 2017-03-15 10:16:45 | 25725.0 |
| 1193 | 1194 | 78 | 944 | 25725 | 1 | debit | 2017-03-16 16:38:26 | 25725.0 |
| 1204 | 1205 | 78 | 970 | 25725 | 1 | credit_card | 2017-03-17 22:32:21 | 25725.0 |
| 1259 | 1260 | 78 | 775 | 77175 | 3 | credit_card | 2017-03-27 09:27:20 | 25725.0 |
| 1384 | 1385 | 78 | 867 | 25725 | 1 | cash | 2017-03-17 16:38:06 | 25725.0 |
| 1419 | 1420 | 78 | 912 | 25725 | 1 | cash | 2017-03-30 12:23:43 | 25725.0 |
| 1452 | 1453 | 78 | 812 | 25725 | 1 | credit_card | 2017-03-17 18:09:54 | 25725.0 |
| 1529 | 1530 | 78 | 810 | 51450 | 2 | cash | 2017-03-29 07:12:01 | 25725.0 |
| 2270 | 2271 | 78 | 855 | 25725 | 1 | credit_card | 2017-03-14 23:58:22 | 25725.0 |
| 2452 | 2453 | 78 | 709 | 51450 | 2 | cash | 2017-03-27 11:04:04 | 25725.0 |
| 2492 | 2493 | 78 | 834 | 102900 | 4 | debit | 2017-03-04 04:37:34 | 25725.0 |
| 2495 | 2496 | 78 | 707 | 51450 | 2 | cash | 2017-03-26 04:38:52 | 25725.0 |
| 2512 | 2513 | 78 | 935 | 51450 | 2 | debit | 2017-03-18 18:57:13 | 25725.0 |
| 2548 | 2549 | 78 | 861 | 25725 | 1 | cash | 2017-03-17 19:36:00 | 25725.0 |
| 2564 | 2565 | 78 | 915 | 77175 | 3 | debit | 2017-03-25 01:19:35 | 25725.0 |
| 2690 | 2691 | 78 | 962 | 77175 | 3 | debit | 2017-03-22 07:33:25 | 25725.0 |
| 2773 | 2774 | 78 | 890 | 25725 | 1 | cash | 2017-03-26 10:36:43 | 25725.0 |
| 2818 | 2819 | 78 | 869 | 51450 | 2 | debit | 2017-03-17 06:25:51 | 25725.0 |
| 2821 | 2822 | 78 | 814 | 51450 | 2 | cash | 2017-03-02 17:13:25 | 25725.0 |
| 2906 | 2907 | 78 | 817 | 77175 | 3 | debit | 2017-03-16 03:45:46 | 25725.0 |
| 2922 | 2923 | 78 | 740 | 25725 | 1 | debit | 2017-03-12 20:10:58 | 25725.0 |
| 3085 | 3086 | 78 | 910 | 25725 | 1 | cash | 2017-03-26 01:59:27 | 25725.0 |
| 3101 | 3102 | 78 | 855 | 51450 | 2 | credit_card | 2017-03-21 05:10:34 | 25725.0 |
| 3151 | 3152 | 78 | 745 | 25725 | 1 | credit_card | 2017-03-18 13:13:07 | 25725.0 |
| 3167 | 3168 | 78 | 927 | 51450 | 2 | cash | 2017-03-12 12:23:08 | 25725.0 |
| 3403 | 3404 | 78 | 928 | 77175 | 3 | debit | 2017-03-16 09:45:05 | 25725.0 |
| 3440 | 3441 | 78 | 982 | 25725 | 1 | debit | 2017-03-19 19:02:54 | 25725.0 |
| 3705 | 3706 | 78 | 828 | 51450 | 2 | credit_card | 2017-03-14 20:43:15 | 25725.0 |
| 3724 | 3725 | 78 | 766 | 77175 | 3 | credit_card | 2017-03-16 14:13:26 | 25725.0 |
| 3780 | 3781 | 78 | 889 | 25725 | 1 | cash | 2017-03-11 21:14:50 | 25725.0 |
| 4040 | 4041 | 78 | 852 | 25725 | 1 | cash | 2017-03-02 14:31:12 | 25725.0 |
| 4079 | 4080 | 78 | 946 | 51450 | 2 | cash | 2017-03-20 21:14:00 | 25725.0 |
| 4192 | 4193 | 78 | 787 | 77175 | 3 | credit_card | 2017-03-18 09:25:32 | 25725.0 |
| 4311 | 4312 | 78 | 960 | 51450 | 2 | debit | 2017-03-01 03:02:10 | 25725.0 |
| 4412 | 4413 | 78 | 756 | 51450 | 2 | debit | 2017-03-02 04:13:39 | 25725.0 |
| 4420 | 4421 | 78 | 969 | 77175 | 3 | debit | 2017-03-09 15:21:35 | 25725.0 |
| 4505 | 4506 | 78 | 866 | 25725 | 1 | debit | 2017-03-22 22:06:01 | 25725.0 |
| 4584 | 4585 | 78 | 997 | 25725 | 1 | cash | 2017-03-25 21:48:44 | 25725.0 |
| 4715 | 4716 | 78 | 818 | 77175 | 3 | debit | 2017-03-05 05:10:44 | 25725.0 |
| 4918 | 4919 | 78 | 823 | 25725 | 1 | cash | 2017-03-15 13:26:46 | 25725.0 |
df['unit_price'].median()
153.0
df_corrected.loc[df_corrected['shop_id'] == 78, ['unit_price']] = df['unit_price'].median()
df_corrected.loc[df_corrected['shop_id'] == 78, ['order_amount']] = df_corrected.loc[df_corrected['shop_id'] == 78,
['total_items']].values * df['unit_price'].median()
df_corrected.loc[df_corrected['shop_id'] == 78]
| order_id | shop_id | user_id | order_amount | total_items | payment_method | created_at | unit_price | |
|---|---|---|---|---|---|---|---|---|
| 160 | 161 | 78 | 990 | 153.0 | 1 | credit_card | 2017-03-12 05:56:57 | 153.0 |
| 490 | 491 | 78 | 936 | 306.0 | 2 | debit | 2017-03-26 17:08:19 | 153.0 |
| 493 | 494 | 78 | 983 | 306.0 | 2 | cash | 2017-03-16 21:39:35 | 153.0 |
| 511 | 512 | 78 | 967 | 306.0 | 2 | cash | 2017-03-09 07:23:14 | 153.0 |
| 617 | 618 | 78 | 760 | 306.0 | 2 | cash | 2017-03-18 11:18:42 | 153.0 |
| 691 | 692 | 78 | 878 | 918.0 | 6 | debit | 2017-03-27 22:51:43 | 153.0 |
| 1056 | 1057 | 78 | 800 | 153.0 | 1 | debit | 2017-03-15 10:16:45 | 153.0 |
| 1193 | 1194 | 78 | 944 | 153.0 | 1 | debit | 2017-03-16 16:38:26 | 153.0 |
| 1204 | 1205 | 78 | 970 | 153.0 | 1 | credit_card | 2017-03-17 22:32:21 | 153.0 |
| 1259 | 1260 | 78 | 775 | 459.0 | 3 | credit_card | 2017-03-27 09:27:20 | 153.0 |
| 1384 | 1385 | 78 | 867 | 153.0 | 1 | cash | 2017-03-17 16:38:06 | 153.0 |
| 1419 | 1420 | 78 | 912 | 153.0 | 1 | cash | 2017-03-30 12:23:43 | 153.0 |
| 1452 | 1453 | 78 | 812 | 153.0 | 1 | credit_card | 2017-03-17 18:09:54 | 153.0 |
| 1529 | 1530 | 78 | 810 | 306.0 | 2 | cash | 2017-03-29 07:12:01 | 153.0 |
| 2270 | 2271 | 78 | 855 | 153.0 | 1 | credit_card | 2017-03-14 23:58:22 | 153.0 |
| 2452 | 2453 | 78 | 709 | 306.0 | 2 | cash | 2017-03-27 11:04:04 | 153.0 |
| 2492 | 2493 | 78 | 834 | 612.0 | 4 | debit | 2017-03-04 04:37:34 | 153.0 |
| 2495 | 2496 | 78 | 707 | 306.0 | 2 | cash | 2017-03-26 04:38:52 | 153.0 |
| 2512 | 2513 | 78 | 935 | 306.0 | 2 | debit | 2017-03-18 18:57:13 | 153.0 |
| 2548 | 2549 | 78 | 861 | 153.0 | 1 | cash | 2017-03-17 19:36:00 | 153.0 |
| 2564 | 2565 | 78 | 915 | 459.0 | 3 | debit | 2017-03-25 01:19:35 | 153.0 |
| 2690 | 2691 | 78 | 962 | 459.0 | 3 | debit | 2017-03-22 07:33:25 | 153.0 |
| 2773 | 2774 | 78 | 890 | 153.0 | 1 | cash | 2017-03-26 10:36:43 | 153.0 |
| 2818 | 2819 | 78 | 869 | 306.0 | 2 | debit | 2017-03-17 06:25:51 | 153.0 |
| 2821 | 2822 | 78 | 814 | 306.0 | 2 | cash | 2017-03-02 17:13:25 | 153.0 |
| 2906 | 2907 | 78 | 817 | 459.0 | 3 | debit | 2017-03-16 03:45:46 | 153.0 |
| 2922 | 2923 | 78 | 740 | 153.0 | 1 | debit | 2017-03-12 20:10:58 | 153.0 |
| 3085 | 3086 | 78 | 910 | 153.0 | 1 | cash | 2017-03-26 01:59:27 | 153.0 |
| 3101 | 3102 | 78 | 855 | 306.0 | 2 | credit_card | 2017-03-21 05:10:34 | 153.0 |
| 3151 | 3152 | 78 | 745 | 153.0 | 1 | credit_card | 2017-03-18 13:13:07 | 153.0 |
| 3167 | 3168 | 78 | 927 | 306.0 | 2 | cash | 2017-03-12 12:23:08 | 153.0 |
| 3403 | 3404 | 78 | 928 | 459.0 | 3 | debit | 2017-03-16 09:45:05 | 153.0 |
| 3440 | 3441 | 78 | 982 | 153.0 | 1 | debit | 2017-03-19 19:02:54 | 153.0 |
| 3705 | 3706 | 78 | 828 | 306.0 | 2 | credit_card | 2017-03-14 20:43:15 | 153.0 |
| 3724 | 3725 | 78 | 766 | 459.0 | 3 | credit_card | 2017-03-16 14:13:26 | 153.0 |
| 3780 | 3781 | 78 | 889 | 153.0 | 1 | cash | 2017-03-11 21:14:50 | 153.0 |
| 4040 | 4041 | 78 | 852 | 153.0 | 1 | cash | 2017-03-02 14:31:12 | 153.0 |
| 4079 | 4080 | 78 | 946 | 306.0 | 2 | cash | 2017-03-20 21:14:00 | 153.0 |
| 4192 | 4193 | 78 | 787 | 459.0 | 3 | credit_card | 2017-03-18 09:25:32 | 153.0 |
| 4311 | 4312 | 78 | 960 | 306.0 | 2 | debit | 2017-03-01 03:02:10 | 153.0 |
| 4412 | 4413 | 78 | 756 | 306.0 | 2 | debit | 2017-03-02 04:13:39 | 153.0 |
| 4420 | 4421 | 78 | 969 | 459.0 | 3 | debit | 2017-03-09 15:21:35 | 153.0 |
| 4505 | 4506 | 78 | 866 | 153.0 | 1 | debit | 2017-03-22 22:06:01 | 153.0 |
| 4584 | 4585 | 78 | 997 | 153.0 | 1 | cash | 2017-03-25 21:48:44 | 153.0 |
| 4715 | 4716 | 78 | 818 | 459.0 | 3 | debit | 2017-03-05 05:10:44 | 153.0 |
| 4918 | 4919 | 78 | 823 | 153.0 | 1 | cash | 2017-03-15 13:26:46 | 153.0 |
df[df['shop_id'] == 42]
| order_id | shop_id | user_id | order_amount | total_items | payment_method | created_at | unit_price | |
|---|---|---|---|---|---|---|---|---|
| 15 | 16 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-07 04:00:00 | 352.0 |
| 40 | 41 | 42 | 793 | 352 | 1 | credit_card | 2017-03-24 14:15:41 | 352.0 |
| 60 | 61 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-04 04:00:00 | 352.0 |
| 308 | 309 | 42 | 770 | 352 | 1 | credit_card | 2017-03-11 18:14:39 | 352.0 |
| 409 | 410 | 42 | 904 | 704 | 2 | credit_card | 2017-03-04 14:32:58 | 352.0 |
| 520 | 521 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-02 04:00:00 | 352.0 |
| 834 | 835 | 42 | 792 | 352 | 1 | cash | 2017-03-25 21:31:25 | 352.0 |
| 835 | 836 | 42 | 819 | 704 | 2 | cash | 2017-03-09 14:15:15 | 352.0 |
| 938 | 939 | 42 | 808 | 1056 | 3 | credit_card | 2017-03-13 23:43:45 | 352.0 |
| 979 | 980 | 42 | 744 | 352 | 1 | debit | 2017-03-12 13:09:04 | 352.0 |
| 1104 | 1105 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-24 04:00:00 | 352.0 |
| 1362 | 1363 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-15 04:00:00 | 352.0 |
| 1364 | 1365 | 42 | 797 | 1760 | 5 | cash | 2017-03-10 06:28:21 | 352.0 |
| 1367 | 1368 | 42 | 926 | 1408 | 4 | cash | 2017-03-13 02:38:34 | 352.0 |
| 1436 | 1437 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-11 04:00:00 | 352.0 |
| 1471 | 1472 | 42 | 907 | 1408 | 4 | debit | 2017-03-12 23:00:22 | 352.0 |
| 1512 | 1513 | 42 | 946 | 352 | 1 | debit | 2017-03-24 13:35:04 | 352.0 |
| 1520 | 1521 | 42 | 756 | 704 | 2 | debit | 2017-03-22 13:10:31 | 352.0 |
| 1562 | 1563 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-19 04:00:00 | 352.0 |
| 1602 | 1603 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-17 04:00:00 | 352.0 |
| 1911 | 1912 | 42 | 739 | 704 | 2 | cash | 2017-03-07 05:42:52 | 352.0 |
| 1929 | 1930 | 42 | 770 | 352 | 1 | credit_card | 2017-03-17 08:11:13 | 352.0 |
| 2003 | 2004 | 42 | 934 | 704 | 2 | cash | 2017-03-26 09:21:26 | 352.0 |
| 2018 | 2019 | 42 | 739 | 352 | 1 | debit | 2017-03-01 12:42:26 | 352.0 |
| 2053 | 2054 | 42 | 951 | 352 | 1 | debit | 2017-03-19 11:49:12 | 352.0 |
| 2153 | 2154 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-12 04:00:00 | 352.0 |
| 2273 | 2274 | 42 | 747 | 704 | 2 | debit | 2017-03-27 20:48:19 | 352.0 |
| 2297 | 2298 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-07 04:00:00 | 352.0 |
| 2491 | 2492 | 42 | 868 | 704 | 2 | debit | 2017-03-01 18:33:33 | 352.0 |
| 2609 | 2610 | 42 | 868 | 704 | 2 | debit | 2017-03-23 18:10:14 | 352.0 |
| 2766 | 2767 | 42 | 970 | 704 | 2 | credit_card | 2017-03-05 10:45:42 | 352.0 |
| 2835 | 2836 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-28 04:00:00 | 352.0 |
| 2969 | 2970 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-28 04:00:00 | 352.0 |
| 2987 | 2988 | 42 | 819 | 1056 | 3 | cash | 2017-03-03 09:09:25 | 352.0 |
| 3332 | 3333 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-24 04:00:00 | 352.0 |
| 3513 | 3514 | 42 | 726 | 1056 | 3 | debit | 2017-03-24 17:51:05 | 352.0 |
| 3651 | 3652 | 42 | 830 | 352 | 1 | credit_card | 2017-03-24 22:26:58 | 352.0 |
| 3697 | 3698 | 42 | 839 | 352 | 1 | debit | 2017-03-12 02:45:09 | 352.0 |
| 3903 | 3904 | 42 | 975 | 352 | 1 | debit | 2017-03-12 01:28:31 | 352.0 |
| 3998 | 3999 | 42 | 886 | 352 | 1 | debit | 2017-03-09 20:10:41 | 352.0 |
| 4056 | 4057 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-28 04:00:00 | 352.0 |
| 4231 | 4232 | 42 | 962 | 352 | 1 | cash | 2017-03-04 00:01:19 | 352.0 |
| 4294 | 4295 | 42 | 859 | 704 | 2 | cash | 2017-03-24 20:50:40 | 352.0 |
| 4326 | 4327 | 42 | 788 | 704 | 2 | debit | 2017-03-16 23:37:57 | 352.0 |
| 4421 | 4422 | 42 | 736 | 704 | 2 | credit_card | 2017-03-01 12:19:49 | 352.0 |
| 4625 | 4626 | 42 | 809 | 352 | 1 | credit_card | 2017-03-11 08:21:26 | 352.0 |
| 4646 | 4647 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-02 04:00:00 | 352.0 |
| 4745 | 4746 | 42 | 872 | 352 | 1 | debit | 2017-03-24 00:57:24 | 352.0 |
| 4767 | 4768 | 42 | 720 | 704 | 2 | credit_card | 2017-03-14 10:26:08 | 352.0 |
| 4868 | 4869 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-22 04:00:00 | 352.0 |
| 4882 | 4883 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-25 04:00:00 | 352.0 |
df[(df['shop_id'] == 42) & (df['user_id'] == 607)]
| order_id | shop_id | user_id | order_amount | total_items | payment_method | created_at | unit_price | |
|---|---|---|---|---|---|---|---|---|
| 15 | 16 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-07 04:00:00 | 352.0 |
| 60 | 61 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-04 04:00:00 | 352.0 |
| 520 | 521 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-02 04:00:00 | 352.0 |
| 1104 | 1105 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-24 04:00:00 | 352.0 |
| 1362 | 1363 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-15 04:00:00 | 352.0 |
| 1436 | 1437 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-11 04:00:00 | 352.0 |
| 1562 | 1563 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-19 04:00:00 | 352.0 |
| 1602 | 1603 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-17 04:00:00 | 352.0 |
| 2153 | 2154 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-12 04:00:00 | 352.0 |
| 2297 | 2298 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-07 04:00:00 | 352.0 |
| 2835 | 2836 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-28 04:00:00 | 352.0 |
| 2969 | 2970 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-28 04:00:00 | 352.0 |
| 3332 | 3333 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-24 04:00:00 | 352.0 |
| 4056 | 4057 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-28 04:00:00 | 352.0 |
| 4646 | 4647 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-02 04:00:00 | 352.0 |
| 4868 | 4869 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-22 04:00:00 | 352.0 |
| 4882 | 4883 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-25 04:00:00 | 352.0 |
df['total_items'].median()
2.0
df_corrected.loc[(df_corrected['shop_id'] == 42) &
(df_corrected['user_id'] == 607), ['total_items']] = df['total_items'].median()
df_corrected.loc[(df_corrected['shop_id'] == 42) &
(df_corrected['user_id'] == 607), ['order_amount']] = df_corrected.loc[(df_corrected['shop_id'] == 42) &
(df_corrected['user_id'] == 607), ['unit_price']].values * df['total_items'].median()
df_corrected[(df_corrected['shop_id'] == 42) & (df_corrected['user_id'] == 607)]
| order_id | shop_id | user_id | order_amount | total_items | payment_method | created_at | unit_price | |
|---|---|---|---|---|---|---|---|---|
| 15 | 16 | 42 | 607 | 704.0 | 2 | credit_card | 2017-03-07 04:00:00 | 352.0 |
| 60 | 61 | 42 | 607 | 704.0 | 2 | credit_card | 2017-03-04 04:00:00 | 352.0 |
| 520 | 521 | 42 | 607 | 704.0 | 2 | credit_card | 2017-03-02 04:00:00 | 352.0 |
| 1104 | 1105 | 42 | 607 | 704.0 | 2 | credit_card | 2017-03-24 04:00:00 | 352.0 |
| 1362 | 1363 | 42 | 607 | 704.0 | 2 | credit_card | 2017-03-15 04:00:00 | 352.0 |
| 1436 | 1437 | 42 | 607 | 704.0 | 2 | credit_card | 2017-03-11 04:00:00 | 352.0 |
| 1562 | 1563 | 42 | 607 | 704.0 | 2 | credit_card | 2017-03-19 04:00:00 | 352.0 |
| 1602 | 1603 | 42 | 607 | 704.0 | 2 | credit_card | 2017-03-17 04:00:00 | 352.0 |
| 2153 | 2154 | 42 | 607 | 704.0 | 2 | credit_card | 2017-03-12 04:00:00 | 352.0 |
| 2297 | 2298 | 42 | 607 | 704.0 | 2 | credit_card | 2017-03-07 04:00:00 | 352.0 |
| 2835 | 2836 | 42 | 607 | 704.0 | 2 | credit_card | 2017-03-28 04:00:00 | 352.0 |
| 2969 | 2970 | 42 | 607 | 704.0 | 2 | credit_card | 2017-03-28 04:00:00 | 352.0 |
| 3332 | 3333 | 42 | 607 | 704.0 | 2 | credit_card | 2017-03-24 04:00:00 | 352.0 |
| 4056 | 4057 | 42 | 607 | 704.0 | 2 | credit_card | 2017-03-28 04:00:00 | 352.0 |
| 4646 | 4647 | 42 | 607 | 704.0 | 2 | credit_card | 2017-03-02 04:00:00 | 352.0 |
| 4868 | 4869 | 42 | 607 | 704.0 | 2 | credit_card | 2017-03-22 04:00:00 | 352.0 |
| 4882 | 4883 | 42 | 607 | 704.0 | 2 | credit_card | 2017-03-25 04:00:00 | 352.0 |
df_corrected['order_amount'].describe()
count 5000.000000 mean 303.854400 std 162.181186 min 90.000000 25% 163.000000 50% 284.000000 75% 390.000000 max 1760.000000 Name: order_amount, dtype: float64
df_corrected['order_amount'].median()
284.0
df_corrected.to_csv('Fall_2022_Data_Science_Intern_Challenge_Corrected_Data_Set.csv',index=False)
plt.figure(figsize=(16,4), dpi=200)
sns.lineplot(x=df_corrected['shop_id'],y=df_corrected['order_amount'])
<AxesSubplot:xlabel='shop_id', ylabel='order_amount'>
fig = px.scatter(
df_corrected,
x=df_corrected['shop_id'],
y=df_corrected['order_amount'],
width=950,
height=350,
color=df_corrected['order_amount']
)
fig.update_layout(
xaxis_title='Shop ID',
yaxis_title='Order Amount',
title='Price Analysis'
)
fig.show()
SELECT c.Country, od.ProductID, p.ProductName, SUM(od.Quantity) AS Total_Order
FROM Customers c, OrderDetails od, Orders o, Products p
WHERE c.CustomerID = o.CustomerID
AND od.ProductID = p.ProductID
AND o.OrderID = od.OrderID
AND c.Country = 'Germany'
GROUP BY ProductName
ORDER BY Total_Order DESC
LIMIT 1